The Chi Square Test and Measures of Association

class: center, middle, inverse, title-slide

.title[
# The Chi Square Test and Measures of Association
]
.subtitle[
## EDP 613
]
.author[
### Week 12
]

---

<div>
<style type="text/css">.xaringan-extra-logo {
width: 110px;
height: 128px;
z-index: 0;
background-image: url(/Users/skynet/Documents/WVU/Teaching/GitHub.nosync/edp613/static/img/course_hex_alpha.png);
background-size: contain;
background-repeat: no-repeat;
position: absolute;
top:1em;right:1em;
}
</style>
<script>(function () {
  let tries = 0
  function addLogo () {
    if (typeof slideshow === 'undefined') {
      tries += 1
      if (tries < 10) {
        setTimeout(addLogo, 100)
      }
    } else {
      document.querySelectorAll('.remark-slide-content:not(.title-slide):not(.inverse):not(.hide_logo)')
        .forEach(function (slide) {
          const logo = document.createElement('a')
          logo.classList = 'xaringan-extra-logo'
          logo.href = 'https://edp613.asocialdatascientist.com'
          slide.appendChild(logo)
        })
    }
  }
  document.addEventListener('DOMContentLoaded', addLogo)
})()</script>
</div>

# A Note About The Slides

Currently the equations do not show up properly in Firefox. Other browsers such as Chrome and Safari do work.

---

# Independence

Two variables that have no association with each other are **statistically independent**.

---

# Frequencies

- **expected frequencies**

> written `$f_e$`
  
--
  
  > what you would *expect* in a bivariate table if two variables were statistically independent 
  
--
  
  > only assumption: the null hypothesis is true
  
--

> calculated by `$$f_e = \dfrac{\text{column marginal}\cdot\text{row marginal}}{\text{total sample size}}$$`
  
--

- **observed frequencies**

> written `$f_o$`
  
--

> what you would *observe* in a bivariate table given what you have
  
--

> calculated by you or given

---

# Chi-Square Test

> written `$\chi^2$`.

> assumes *random sampling*

> Is an inferential test to find significant relationships between two variables.

> Calculated by `$$\chi^2 =\sum\dfrac{(f_o-f_e)^2}{f_e}$$`

> with `$$df = (r-1)(c-1)$$`

---

### Example: Social Media

The percent of people using at least one social media outlet is given below by age groups
.footnote[Source: [*Pew Research Center: Social Media Fact Sheet*](https://www.pewresearch.org/internet/fact-sheet/social-media/)]

.pull-left[
<center>In 2011:</center> 
<br>
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:center;"> Portion </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 18 - 29 </td>
   <td style="text-align:center;"> 820 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 30 - 49 </td>
   <td style="text-align:center;"> 590 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 50 - 64 </td>
   <td style="text-align:center;"> 360 </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 65+ </td>
   <td style="text-align:center;"> 120 </td>
  </tr>
</tbody>
</table>
]

.pull-right[
<center>In 2021:</center> 
<br>
<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:center;"> Responses </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 18 - 29 </td>
   <td style="text-align:center;"> `840` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 30 - 49 </td>
   <td style="text-align:center;"> `810` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 50 - 64 </td>
   <td style="text-align:center;"> `730` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 65+ </td>
   <td style="text-align:center;"> `450` </td>
  </tr>
</tbody>
</table>
]

a. Test the assumption that *users are equally likely* to be in each of the four age groups listed.

b. Which age group contributes the largest amount to the test statistic?

---

### Example: Solution for 2011

a. We have

`$$H_0: \text{Users are equally likely to be in each of the four groups listed}$$`
--

`$$H_1: \text{Users are NOT equally likely to be in each of the four groups listed}$$`

> Step 1: Find `$N$`
<br>
<br>

.pull-left[
We have `$820+590+360+120=1890$` total responses
]

.pull-right[
If the distribution was uniform across all four categories, we would expect that each had `$1890/4\approx472$` respondents
]

---

> Step 2: Caluclate the `$\chi^2$` statistic

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:center;"> Responses </th>
   <th style="text-align:center;"> `\chi^2` </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 18 - 29 </td>
   <td style="text-align:center;"> `820` </td>
   <td style="text-align:center;"> `\frac{(820-472)^2}{472} \approx 256.576` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 30 - 49 </td>
   <td style="text-align:center;"> `590` </td>
   <td style="text-align:center;"> `\frac{(590-472)^2}{472} \approx 29.500` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 50 - 64 </td>
   <td style="text-align:center;"> `360` </td>
   <td style="text-align:center;"> `\frac{(360-472)^2}{472} \approx 26.576` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 65+ </td>
   <td style="text-align:center;"> `120` </td>
   <td style="text-align:center;"> `\frac{(120-472)^2}{472} \approx 62.509` </td>
  </tr>
</tbody>
</table>

with the total `$$256.576 + 29.500 + 26.576 + 62.509 = 375.161$$`

and `$$df = 4-1 = 3$$`

---

> Step 3: Make a Decision

In Appendix D

- Look at `$df = 3$`
 
--
 
  - `$\chi^2=375.161$` < the greatest `$p$`-value so `$p<0.001$`
 
--

- We reject `$H_0$` implying that
  
--
  
<center>  
 <i> respondents are not equally likely to be in each of the four age ranges listed</i>
</center>

---

b . 
  
  - 65+ contributes the greatest amount to the sum for the test statistic
  
--

- The observed count is much smaller than expected

---

### Example: Solution for 2021

We have

`$$H_0: \text{Users are equally likely to be in each of the four groups listed}$$`

`$$H_1: \text{Users are NOT equally likely to be in each of the four groups listed}$$`

---

> Step 1: Find `$N$`
<br>
<br>

.pull-left[
We have `$840+810+730+450=2830$` total responses
]

.pull-right[
If the distribution was uniform across all four categories, we would expect that each had `$2830/4\approx707$` respondents
]

---

> Step 2: Caluclate the `$\chi^2$` statistic

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;"> Age </th>
   <th style="text-align:center;"> Responses </th>
   <th style="text-align:center;"> `\chi^2` </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> 18 - 29 </td>
   <td style="text-align:center;"> `840` </td>
   <td style="text-align:center;"> `\frac{(840-707)^2}{707} \approx 25.020` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 30 - 49 </td>
   <td style="text-align:center;"> `810` </td>
   <td style="text-align:center;"> `\frac{(810-707)^2}{707} \approx 15.006` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 50 - 64 </td>
   <td style="text-align:center;"> `730` </td>
   <td style="text-align:center;"> `\frac{(730-707)^2}{707} \approx 0.748` </td>
  </tr>
  <tr>
   <td style="text-align:left;"> 65+ </td>
   <td style="text-align:center;"> `450` </td>
   <td style="text-align:center;"> `\frac{(450-707)^2}{707} \approx 93.422` </td>
  </tr>
</tbody>
</table>

with the total `$$5.020+15.006+0.748+93.422 = 134.196$$`

and `$$df = 4-1 = 3$$`

---

> Step 3: Make a Decision

In Appendix D

- Look at `$df = 3$`
 
  - `$\chi^2=33.526$` < the greatest `$p$`-value so `$p<0.001$`
 
--

- We reject `$H_0$` implying that
  
<center>  
 <i> respondents are not equally likely to be in each of the four age ranges listed</i>
</center>

---

## That's it. Take a break before our R session!